NCHLT SiSwati POS tag set
For purposes of annotators, this tag set is by and large taken over from
Taljard et al. (2008) and various documents compiled
by G. Faasz and U. Heid from the IMS, Stuttgart and
D.J. Prinsloo and E. Taljard, University of Pretoria.
The information below refers to the current state of the tagset,
but further development will probably necessitate any number of changes.
The tagset is mainly based on the lexical and
morphological criteria defined by Lombard (1985) and Louwrens (1991). The
logical structure of the tagset is divided into two
layers of linguistic description (annotation levels):
The first annotation level (level 1) includes all mandatory, or,
according to EAGLES, obligatory information, namely up to three elements: an
element hinting at the word class, a second one specifying functional or
syntactic properties, and a third one giving morphological specifics, cf. e.g. PRO(noun)EMP(hatic)PERS(on).
The second level of annotation (level 2) includes recommended and
optional information. This level is in most cases used for a detailed
description of closed class items described in the tagger lexicon. Compare the
following excerpt:
Figure 1: Annotation levels
Description |
Tag 1st level
(mandatory information) |
Tag 2nd level
(optional/ recommended information) |
Pronouns: |
|
|
emphatic personal |
PROEMPPERS |
1sg,2sg,1pl,2pl |
Verbals: |
V |
tr |
Morphemes: |
|
|
deficient |
MORPH |
def |
For disjunctive languages, next to all orthographic words, all
linguistic words will also be tagged, resulting in two layers of POS
annotation: one for all orthographic words and one for all linguistic words.
For conjunctive languages, this extra layer of POS annotation is not needed.
The tagset currently distinguishes 20
categories applicable to Siswati and two different levels of annotation.
However, only level 1 has been annotated. The first part of the tag gives a
general indication of the nature of the unit in question. These are as follows:
Tag |
Explanation |
PUNC |
Punctuation |
ABBR |
Abbreviation
(incl. acronyms) |
ADJ |
Adjective
(incl. enumerative) |
ADV |
Adverb |
CDEM |
Class-indicating
demonstrative |
CONJ |
Conjunction
|
COP |
Copulative
(copulative subject concord, demonstrative copulative, copulative verb) |
FOR |
Foreign |
IDEO |
Ideophone |
INT |
Interjection |
INTER |
Question
word |
N |
Noun |
NPP |
Place
and brand name |
NUM |
Numerative |
POSS |
Possessive
(possessive concord, possessive pronoun) |
PROEMP |
Emphatic
pronoun |
PROQUANT |
Quantitative
pronoun |
REL |
Relative |
V |
Verbal |
VAUX |
Auxiliary
verb |
|
|
|
|
Tags not
applicable to SiSwati |
|
ASP |
Aspectual marker |
AUX |
Auxiliary stem |
CN |
Class-indicating nominal prefix |
CO |
Class-indicating object concord |
CS |
Class-indicating subject concord |
MNEG |
Negative morpheme |
PART |
Particle |
TENS |
Tense marker |
Level 1: PUNC
; |
PUNC |
( |
PUNC |
! |
PUNC |
“ |
PUNC |
Level 1: ABBR
NGO |
ABBR |
njll |
ABBR |
Level 1: ADJ01-11,
ADJ14-15, ADJ01a, ADJ02a, ADJLOC
munye |
ADJ01 |
labasha |
ADJ02 |
kuletinye |
ADJLOC |
Level 1: ADV, ADVLOC
ngeke |
ADV |
ngaphandle |
ADV |
ekhatsi |
ADVLOC |
Level 1:
CDEM01-11, CDEM14-15, CDEMLOC
laba |
CDEM02 |
leyo |
CDEM04 |
kulelo |
CDEMLOC |
Level 1: CONJ
futsi |
CONJ |
kepha |
CONJ |
Level 1: COP
Level 2: COP_neg, COP_nil
(-be, -bê and
–bilê).
For the copulative verb stem –se the tag COP_neg on level 2 is used, as is the case for the verb
stem –be (<-ba) when it is used in the negative form.
yincenye |
COP |
likwati |
COP |
Level 1: FOR
development |
FOR |
planning |
FOR |
Level 1: IDEO
mbamba |
IDEO |
ngco |
IDEO |
Level 1: INT
Level 2: INT_neg, INT_nil
hhayi |
INT |
hawu |
INT |
Level 1: INTER
Level 2: _man,
_time, _loc, _N01a, _N02a
nini |
INTER |
bani |
INTER |
Level 1: N01-11, N14-15, N01a, N02a, NLOC, N00
Level 2: _aug, _dim, _loc, _name, _nil
cembu |
N00 |
umuntfu |
N01 |
bafati |
N02 |
bomake |
N02a |
lizinga |
N05 |
budlelwane |
N14 |
edolobheni |
NLOC |
Level 1: NPP
Level 2: NPP_place, NPP_brand
KaZulu-Natali |
NPP |
Mars |
NPP |
Level 1: NUM
2.2 |
NUM |
74(a) |
NUM |
2005 |
NUM |
Level 1:
POSS01-11, POSS14-15, POSSLOC, POSSPERS,
POSSKA
Level 2: POSSPERS_1pl,
POSSPERS_2pl
wahulumende |
POSS03 |
yato |
POSS04 |
lwakhe |
POSS11 |
Level 1:
PROEMP01-11, PROEMP14-15, PROEMPLOC, PROEMPPERS
Level 2:
PROEMPPERS_1sg, PROEMPPERS_1pl, PROEMPPERS_2sg, PROEMPPERS_2pl
wena |
PROEMP03 |
yona |
PROEMP09 |
kuto |
PROEMPLOC |
Level 1:
PROQUANT01-11, PROQUANT14-15, PROQUANTLOC
wonkhe |
PROQUANT01 |
sonkhe |
PROQUANT07 |
konkhe |
PROQUANT15 |
Level 1: REL
esimeni |
REL |
labacabene |
REL |
Level 1: V
Level 2: V_tr, V_itr, V_dtr
kubona |
V |
babelana |
V |
kuhlelwa |
V |
Level 1: VAUX
Level 2: VAUX_tr, VAUX-itr, VAUX_dtr
cishe |
VAUX |
kube |
VAUX |